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Abstract— \ When there are multiple node failures in a dis- 
tributed storage system, regenerating the failed storage nodes 
individually in a one-by-one manner is suboptimal as far as 
repair-bandwidth minimization is concerned. If data exchange 
among the newcomers is enabled, we can get a better tradeoff 
between repair bandwidth and the storage per node. An explicit 
and optimal construction of cooperative regenerating code is 
illustrated. 

Index Terms — Distributed Storage, Repair Bandwidth, Regen- 
erating Codes, Erasure Codes, Network Coding. 

I. Introduction 

Distributed storage system provides a scalable solution to 
the ever-increasing demand of reliable storage. The storage 
nodes are distributed in different geographical locations, and 
in case some disastrous event happened to one of them, the 
source data would remain intact. There are two common 
strategies for preventing data loss against storage node failures. 
The first one, employed by the current Google file system Q, 
is data replication. Although replication-based scheme is easy 
to manage, it has the drawback of low storage efficiency. 
The second one is based on erasure coding, and is used in 
Oceanstore |2] and Total Recall 1 3 ] for instance. With erasure 
coding, The storage network can be regarded as an erasure 
code which can correct any n — k erasures; a file is encoded 
into n pieces of data, and from any k of them the original file 
can be reconstructed. 

When a storage node fails, an obvious way to repair it is 
to rebuild the whole file from some other k nodes, and then 
re-encode the data. The disadvantage of this method is that, 
when the file size is very large, excessive traffic is generated in 
the network. The bandwidth required in the repairing process 
seems to be wasted, because only a fraction of the downloaded 
data is stored in the new node after regeneration. By viewing 
the repair problem as a single- source multi-cast problem in 
network coding theory, Dimakis et al discovered a tradeoff 
between the amount of storage in each node and the bandwidth 
required in the repair process [4 j . Erasure codes for distributed 
storage system, aiming at minimizing the repair-bandwidth, 
is called regenerating code. The construction of regenerating 
code is under active research. We refer the readers to and 
the references therein for the application of network coding in 
distributed storage systems. 

Most of the results in the literature on regenerating codes 
are for repairing a single storage node. However, there are 
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several scenarios where multiple failures must be considered. 
Firstly, in a system with high churn rate, the nodes may join 
and leave the system very frequently. When two or more nodes 
join the distributed storage system at the same time, the new 
nodes can exploit the opportunity of exchange data among 
themselves in the repair process. Secondly, node repair may 
be done in batch. In systems like Total Recall, a recovery 
is triggered when the fraction of available nodes is below a 
certain threshold, and the failed nodes are then repaired as a 
group. The new nodes which are going to be regenerated are 
called newcomers. There are two ways in regenerating a group 
of newcomer: we may either repair them one by one, or repair 
them jointly with cooperation among the newcomers. It is 
shown in (6), (7) that further reduction of repair-bandwidth is 
possible with cooperative repair. Let the number of newcomers 
be r. In [6] each newcomer is required to connect to all n — r 
surviving storage nodes during the repair process, and in Q, 
this requirement is relaxed such that different newcomers may 
have different number of connections. However, in both (6) 
and (7), only the storage systems which minimize storage per 
node are considered. 

In this paper, an example of cooperatively regenerating 
multiple newcomers is described in Section [TTJ In Section [nil 
we define the information flow graph for cooperative repair, 
and derive a lower bound on repair-bandwidth. This lower 
bound is applicable to functional repair, where the content of 
a newcomer may not be the same as the failed node to be 
replaced, but the property that any k nodes can reconstruct 
the original file is retained. The lower bound is function 
of the storage per node, and hence is an extension of the 
results in l6l. A more practical and easier- to-manage mode 
of operation is called exact repair, in which the regenerated 
node contains exactly the same encoded data as in the failed 
node. In Section [IVl we give a family of explicit code 
constructions which meet the lower bound, and hence show 
that the construction is optimal. 

II. An Example of Cooperative Repair 

Consider the following example taken from [8]. Four data 
packets A\, A2, B\ and B2, are distributed to four storage 
nodes. Each of them stores two packets. The first one stores 
Ai and A2, the second stores B\ and B2. The third and fourth 
nodes are parity nodes. The third node contains two packets 
Ai + B\ and 2A2 + B2, and the last node contains 2Ai + B\ 
and A2 + B2. Here, a packet is interpreted as an element in 
a finite field, and addition and multiplication are finite field 
operations. We can take GF(5) as the underlying finite field 
in this example. Any data collector, after downloading the 
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Fig. 1. Repairing a single node failure with minimum repair bandwidth 



Fig. 2. Individual regeneration of multiple failures 



packets from any two storage nodes, can reconstruct the four 
original packets by solving a system of linear equations. For 
example, if we download from the third and fourth nodes, we 
can recover A\ and B\ from packets A\ + B\ and 2A\ + B\, 
and recover A2 and B2 from packets 2A 2 + £?2 and A2 + £?2- 

Suppose that the first node fails. To repair the first node, we 
can download four packets from any other two nodes, from 
which we can recover the two required packets A\ and A 2 . 
For example, if we download the packets from the second and 
third nodes, we have Bi, B 2 , Ai + B 1 and 2A 2 + B 2 . We 
can then recover A\ by subtracting B\ from A\ + B\, and 
A 2 by computing ((2A 2 + B 2 ) - B 2 )/2. It is illustrated in (U 
that we can reduce the repair-bandwidth from four packets 
to three packets, by making three connections to the three 
remaining nodes, and downloading one packet from each of 
them (Fig. [T]). Each of the three remaining nodes simply adds 
the two packets and sends the sum to the newcomer, who can 
then subtract off B 1 + B 2 and obtain A 1 + 2A 2 and 2 Ax + A 2 , 
from which A\ and A 2 can be solved. 

When two storage nodes fail simultaneously, the compu- 
tational trick mentioned in the previous paragraph no longer 
works. Suppose that the second and the fourth storage nodes 
fail at the same time. To repair both of them separately, each of 
the newcomers can download four packets from the remaining 
storage nodes, reconstruct packets Ai, A 2 , B\ and B 2 , and 
re-encode the desired packets (Fig. 0. This is the best we 
can do with separate repair. Using the result in [4j, it can 
be shown that any one-by-one repair process with repair- 
bandwidth strictly less than four packets per newcomer is 
infeasible. 

If the two newcomers can exchange data during the regener- 
ation process, the total repair-bandwidth can indeed be reduced 
from eight packets to six packets (Fig.0. The two newcomers 
first make an agreement that one of them downloads the 
packets with subscript 1, and the other one downloads the 
packets with subscript 2. (They can compare, for instance, 
their serial numbers in order to determine who downloads 
the packets with smaller subscript.) The first newcomer gets 
A\ and A\ + B\ from node 1 and 3 respectively, while the 
second newcomer gets A 2 and 2A 2 + B 2 from node 1 and 
3 respectively. The first newcomer then computes B\ and 
2A\ + B\ by taking the difference and the sum of the two 
inputs. The packet B\ is stored in the first newcomer and 
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Fig. 3. Cooperative regeneration of multiple failures 



2A\ J r Bi is sent to the second newcomer. Similarly, the second 
newcomer computes B 2 and A 2 + B 2 , stores A 2 + B 2 in 
memory and sends B 2 to the first newcomer. Only six packet 
transmissions are required in this joint regeneration process. 

III. Information Flow Graph and Min-Cut Bound 

We formally define the cooperative repair problem as fol- 
lows. There are two kinds of entities in a distributed storage 
system, storage nodes and data collectors, and two kinds 
of operations, file reconstruction and node repair. A file of 
size B units is encoded and distributed among the n storage 
nodes, each of them stores a units of data. The file can be 
reconstructed by a data collector connecting to any k storage 
nodes. Upon the failure of r nodes, a two-phase repair process 
is triggered. In the first phase, each of the r newcomers 
connects to d remaining storage nodes, and download f3i 
units of data from each of them. After processing the data 
they have downloaded, the r newcomers exchange some data 
among themselves, by sending f3 2 units of data to each of 
the other r — 1 newcomers. Each newcomer downloads d(3i 
units of data in the first phase and (r — l)/3 2 units of data 
in the second phase. The repair-bandwidth per node is thus 

7 = ^1 + ^-1)02- 

In the remaining of this paper, we will assume that d> k. 

We construct an information flow graph as follows. There 
are three types of vertices in the information flow graph: one 
for the source data, one for the storage nodes and one for data 
collectors. The vertices are divided into stages. We proceed 
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Fig. 4. Information flow graph 



from one stage to the next stage after a repair process is 
completed. (Fig. HJ). 

There is one single vertex, called the source vertex, in stage 
— 1, representing the original data file. The n storage nodes 
are represented by n vertices in stage 0, called Out^, for i = 
1, 2, . . . , n. The source vertex is connected to each vertex in 
stage by a directed edge with capacity a. For s = 1,2,3,..., 
let 1Z S be the set of r storage nodes which fail in stage 5 — 1, 
and are regenerated in stage s. The set 7Z S is a subset of 
{1, 2, . . . , n} with cardinality r. For each storage node p in 
7Z S , we construct three vertices in stage s: \n p , Mid p and Out p . 
Vertex \n p has d incoming edges with capacity ft, emanated 
from d "out" nodes in previous stages. We join vertex \n p and 
Midp with a directed edge of infinite capacity. For p,q e7Z s , 
p ^ q, there is a directed edge from \n p to Mid g with capacity 
ft. Newcomer p stores a units of data, and this is represented 
by a directed edge from Mid p to Out p with capacity a. 

For each data collector, we add a vertex, called DC, in the 
information flow graph. It is connected to k "out" nodes with 
distinct indices, but not necessarily from the same stage, by k 
infinite-capacity edges. 

We call an information flow graph constructed in this way 
G(n, /c, r; a, ft , ft), or simply G if the parameters are 
clear from the context. The number of stages is potentially 
unlimited. 

A cut in an information flow graph is a partition of the set 
of vertices, (U,U), such that the source vertex is in hi and a 
designated data collector is in hi. We associate with each cut a 
value, called the capacity, defined as the sum of the capacities 
of the directed edges from vertices in hi to vertices in hi. An 
example is shown in Fig. [5] The max-flow-min-cut bound in 
network coding for single-source multi-cast network states that 
if the minimum cut capacities between data collectors and the 
source is at no larger than C, then the amount of data we can 
send to each data collector is no more than C l9l . 

Theorem 1. Suppose that d > k. The minimum cut of an 




Fig. 5. A sample cut in the information flow graph. 

jp O G 0"OtO 

io/p_ • o o\m 

go d o o a 

Fig. 6. Two different kinds of cuts within a stage. 
information flow graph G is less than or equal to 

k i-1 

^ min {a, (d - ^) ft + (r - Qp 2 ] (1) 
i=i j=i 

where (£i, £2, • • • , t>k) is any k-tuple of integers satisfying i\ + 
£2 + • • • + ik = k and < £i < r for all i. 

Proof: By relabeling the nodes if necessary, suppose that 
a data collector DC connects to storage node 1 to node k. Let 
si < S2 < ... < s m be the stages in which nodes 1 to k are 
most recently repaired, where m is an integer. We note that 
{1, 2, . . . , k} is contained in the union of 1Z Sl , 1Z S2 , . . . , 1Z Sm . 
For i = 1, 2, . . . , m, let 

Si := ({i,2,...,fc}nft ai ) \(n Si+1 u-.-uiisj. 

The physical meaning of Si is that the storage nodes with 
indices in Si are repaired in stage Si and remain intact until 
the data collector DC shows up. The index sets S^s are disjoint 
and their union is equal to {1,2, ...,&}. We let ii to be the 
cardinality of Si. Obviously we have £1 + £2 + • • • + An = k, 
£i < r for all i, and m < k. 

For i = 1, 2, . . . , m, the £{ "out" nodes in stage Si which 
are connected directly to DC must be in hi, otherwise, there 
would be an infinite-capacity edge from U to hi. In stage Si, 
we consider two different ways to construct a cut. We either 
put all "in" and "mid" nodes associated to the storage nodes 
in Si mU, or put all of them in hi. In Fig. [6l we graphically 
illustrate the two different cuttings. The shaded vertices are in 
hi and the edges from U to hi are shown. 

Each "in" node in the first cut may connect to as small as 
d — 2j=i ^3 "°ut" nodes in hi in previous stages. The sum 
of edge capacities from hi to hi can be as small as £i(d — 
Y^j=i^j)fii + ( r — A) Aft- In the second kind of cut, the 
sum of edge capacities from U to U is £iCt. After taking the 
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Fig. 7. Lower bound on repair-bandwidth (B = 84, d = 4, k = 4, r = 3) 



minimum of these two cut values, we get 



U min {a, (d - J2 *j)Pi + ( r - } • 



(2) 



We obtain the expression in (Q} by summing over z = 
1, 2, . . . , m. ■ 

A cut described in the proof of Theorem [T] is called a cut 
of type (£ u £ 2 ,..., 4). 

We illustrate Theorem [T] by the example in Section HU The 
parameters are n = 4, d = fc = r = 2, 5 = 4, and 
a = B/k = 2. The are two pairs of integers (^1,^2), namely 
(2,0) and (1,1), which satisfy the condition in Theorem [U 
The capacity of minimum cut, by Theorem Q] is no more than 
2min{a,2ft} and min{a, 2ft + ft} + min{a, £1 + ft}. The 
first cut imposes the upper bound 5 < 2min{a, 2ft) on the 
file size B, which implies that ft > 1. The second cut imposes 
another constraint on B, 

4 < min{2, 2ft + ft} + min{2, ft + ft}, 

from which we can deduce that ft + ft > 2. After summing 
ft > 1 and ft + ft > 2, we obtain 7 = 2ft + ft > 3. The 
minimum possible repair-bandwidth 7 = 3 matched by the 
regenerating code presented in Section HJ The regenerating 
code in Section JD is therefore optimal. 

We can formulate the repair-bandwidth minimization prob- 
lem as follows. Given the storage per node, a, we want 
to minimize the objective function 7 = dft + (r — l)ft 
over all non-negative ft and ft subject to the constraints 
that the file size B is no more than the values in (Q}, 
for all legitimate (£1, £2, • • • , £k)- It can be shown that the 
minimization problem can be reduced to a linear program, and 
hence can be effectively solved. We let the resulting optimal 
value be denoted by 7* (a). This is a lower bound on repair- 
bandwidth for a given value of a. 

In Fig. [71 we illustrate the lower bound 7* (a) for B = 84, 
d = 4, k = 4 and r = 3. For comparison, we plot the 
storage-repair-bandwidth tradeoff for non-cooperative one-by- 
one repair in Fig. [71 From [5] Theorem 1], the smallest 



repair-bandwidth of a non-cooperative minimum- storage re- 
generating code is given by the formula Bd/(k(d — k + 1)), 
which is equal to 84 in this example. It can be shown that 
7* (B/k) = B(d + r- l)/(k(d + r- k)). In the next section, 
we give a construction of cooperative regenerating code which 
meets the lower bound 7* (B/k) when d = k. 

IV. An Explicit Construction for Exact Repair 

Exact repair has the advantage that the encoding vectors 
of the newcomers remain the same. This helps in reducing 
maintenance overhead. For non-cooperative and one-by-one 
repair, there are several exact constructions of regenerating 
code available in the literature, for example the constructions 
in |[T0l and ifTTl . In this section, we construct a family of 
regenerating codes for cooperative repair with parameters d = 
k < n — r, which contains the example given in Section [III as 
special case. 

The recipe of this construction needs an maximal-distance 
separable (MDS) code of length n and dimension k. Given n, 
let q be the smallest prime power larger than or equal to n. 
We use the Reed-Solomon (RS) code over GF(q) generated 
by the following generator matrix 



G := 



1 
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a 2 



1 

a 3 



1 



1 



where a\, a 2 , . . . , a n are n distinct elements in GF(q). Let 
be the zth column of G. Given k message symbols in GF(q), 
we put them in a row vector m 7 = [mi m 2 ... m^]. (The 
superscript " T " is the transpose operator.) We encode m T into 
the codeword m T G. The MDS property of RS code follows 
from the fact that every k x k submatrix of G is a non- singular 
Vandermonde matrix. 

We apply the technique called "striping" from coding for 
disk arrays. The whole file of size B is divided into many 
stripes, or chunks, and each chunk of data is encoded and 
treated in the same way. In the following, we will only describe 
the operations on each stripe of data. 

We divide a stripe of data into kr packets, each of them 
is considered as an element in GF(q). The kr packets are 
laid out in an r x k matrix M, called the message matrix. 
To set up the distributed storage system, we first encode the 
message matrix M into MG, which is an r x n matrix. For j = 
1, 2, . . . , n, node j stores the r packets in the jth column of 
MG. Let the r rows of M be denoted by mj , m^, . . . , . 
The packets stored in node j are mf g J? for i = 1, 2, . . . , r. 

A data collector downloads from k storage nodes, say nodes 
ci, c 2 , . . . , Ck G {1,2,..., n}. The kr received packets are 
arranged in an r x k matrix. The (i,j) -entry of this matrix is 
mf g c . This matrix can be factorized as M- [ 
We can reconstruct the original file by inverting the Vander- 
monde matrix [g Cl g C2 ••• g c J. 

Suppose that nodes /1, / 2 , . . . , f r fail. The r newcomers 
first coordinate among themselves, and agree upon an order 
of the newcomers, say by their serial numbers. For the ease of 
notation, suppose that newcomer fj is the jth newcomer, for 
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j = 1, 2, . . . , r. The jth newcomer fj connects to any other 
k remaining storage nodes, say 7Tj(l), 7Tj(2), . . . , 7Tj(k), and 
downloads the packets encoded from mj, namely, mjg^x), 
mjg 7ri ( 2 ),...,mjg 7ri ( fc ). (Recall that we assume k = d 
in this construction.) Since [g^i) g 7rj .( 2 ) • • • g^Ofe)] is non- 
singular, newcomer fj can recover the message vector mj 
after the first phase. In the second phase, newcomer fj 
computes mjgf i for i = 1, 2, . . . , r, and sends the packet 
mjgfi to newcomer fi, i ^ j. A total of r — 1 packets 
are sent from each newcomer in the second phase. After the 
exchange of packets, newcomer fj then has the r required 
packets mf gf for i = 1, 2, . . . , r. The repair-bandwidth per 
each newcomer isfc + r — l = d + r — 1. 

In this construction, we can pick the smallest prime power 
g larger than or equal to n as the size of the finite field. If 
the number of storage nodes n increases, the finite field size 
increases linearly with n. 

Theorem 2. The cooperative regenerating code described 
above is optimal, in the sense that if B = kr, k = d, and each 
node stores a = r packets, the minimal repair-bandwidth per 
each failed node is equal to k + r — 1. 

Proof: We use the notation as in Theorem [T] The capacity 
of a cut of type (£i, • • • , ^k), as shown in (Q}, is an upper 
bound on kr. If any summand (d — J2]=i + ( r — 
in (Q} is strictly less than a — B/k = r for any i, then the 
value in (Q]) is strictly less than Yli=i ^ r = This would 
violate the fact that kr is upper bounded by (Q]). Hence we 
have 

i-l 

( k ~ + (r- U)h >B/k = r (3) 

i=i 

for any cut associated with (^i, ^2, • • • , ^/c) and any i. 

Case 1: r < k = d. From a cut of type (^1,^2? ■ ■ ■ j^fe) = 
(1,1,..., 1), we have 

ft + (r - l)/3 2 > r (4) 

from ©. From another cut of type {t\, I2, • • • , t>k) = 
(1, 1, . . . , 1, r, 0, . . .), from © again, we obtain the condition 

k — r 

(k- (k- r))/Ji + (r - r)/3 2 =r/3 1 >r 

which implies that /3i > 1. We then add (k - l)0i > fc - 1 
to ©, and get 7 = fc/3i + (r - 1)0 2 > fc + r - 1. 

Ca.s'^ 2: r > k = d. Consider the two cuts associated with 
(£i,£ 2 ,.-.,4) equal to (fc,0,...,0) and (fc - 1, 1,0, . . . ,0). 
We obtain the following two inequalities from ([5]), 

fe/3i + (r - /c)0 2 > r (5) 
01 + (r - 1)02 > r. (6) 

We multiply both sides of (0) by (r — 1), and multiply both 
sides of © by k. After adding the two resulting inequalities, 
we get 7 = kf3\ + (r — 1)02 > fc + r — 1. 

The repair-bandwidth per failed node is therefore cannot 
be less than k + r — 1. The repair-bandwidth of the code 
constructed in this section matches this lower bound, and is 
hence optimal. ■ 



The regenerating code constructed in this section has the 
advantage that a storage node participating in a regeneration 
process is required to read and exactly the same amount of 
data to be sent out, without any arithmetical operations. This 
is called the uncoded repair property lfT2l . 

We compare below the repair-bandwidth of three different 
modes of repair, all with parameters n = 7, B — 84, k — 4 and 
a = B/4 = 21. Suppose that three nodes fail simultaneously. 

(i) Individual repair without newcomer cooperation. Each 
newcomer connects to the four remaining storage nodes. As 
calculated in the previous section, the repair-bandwidth per 
newcomer is 84. 

(ii) One-by-one repair utilizing the newly regenerated node 
as a helper. The average repair-bandwidth per newcomer is 

V 8 4 W , , U 51 333 

3 U(4-4 + l) 4(5-4 + 1) 4(6-4 + 1)/ 

The first term in the parenthesis is the repair-bandwidth of 
the first newcomer, which downloads from the four surviving 
nodes, the second term is the repair-bandwidth of the second 
newcomer, who connects to the four surviving nodes and the 
newly regenerated newcomer, and so on. 

(iii) Full cooperation among the three newcomers. The 
repair-bandwidth per newcomer can be reduced to 42 using 
the regenerating code given in this section. We thus see that 
newcomer cooperation is able to reduce the repair-bandwidth 
of a distributed storage system significantly. 
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